{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# XGBoost Parameter Tuning for Happy Customer Bank"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据说明:\n",
    "数据集共26个字段: 其中1-24列为输入特征，25-26列为输出特征。\n",
    "    1. ID - 唯一ID（不能用于预测）\n",
    "    2. Gender - 性别\n",
    "    3. City - 城市\n",
    "    4. Monthly_Income - 月收入（以卢比为单位）\n",
    "    5. DOB - 出生日期\n",
    "    6. Lead_Creation_Date - 潜在（贷款）创建日期\n",
    "    7. Loan_Amount_Applied - 贷款申请请求金额（印度卢比，INR）\n",
    "    8. Loan_Tenure_Applied - 贷款申请期限（单位为年）\n",
    "    9. Existing_EMI -现有贷款的EMI（EMI：电子货币机构许可证） \n",
    "    10. Employer_Name雇主名称\n",
    "    11. Salary_Account - 薪资帐户银行\n",
    "    12. Mobile_Verified - 是否移动验证（Y / N）\n",
    "    13. VAR5 - 连续型变量\n",
    "    14. VAR1-  类别型变量\n",
    "    15. Loan_Amount_Submitted - 提交的贷款金额（在看到资格后修改和选择）\n",
    "    16. Loan_Tenure_Submitted - 提交的贷款期限（单位为年，在看到资格后修改和选择）\n",
    "    17. Interest_Rate - 提交贷款金额的利率\n",
    "    18. Processing_Fee - 提交贷款的处理费（INR）\n",
    "    19. EMI_Loan_Submitted -提交的EMI贷款金额（INR）\n",
    "    20. Filled_Form - 后期报价后是否已填写申请表格\n",
    "    21. Device_Type - 进行申请的设备（浏览器/移动设备）\n",
    "    22. Var2 - 类别型变量\n",
    "    23. Source - 类别型变量\n",
    "    24. Var4 - 类别型变量\n",
    "\n",
    "输出：\n",
    "    25. LoggedIn - 是否login（只用于理解问题的变量，不能用于预测，测试集中没有）\n",
    "    26. Disbursed - 是否发放贷款（目标变量），1为发放贷款（目标客户）\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先 import 必要的模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from xgboost import XGBClassifier\n",
    "import xgboost as xgb\n",
    "\n",
    "import pandas as pd \n",
    "import numpy as np\n",
    "\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "\n",
    "from sklearn.metrics import log_loss\n",
    "\n",
    "from matplotlib import pyplot\n",
    "import seaborn as sns\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "dtrain = pd.read_csv(\"Train.csv\",encoding = 'ISO-8859-1')\n",
    "#dtrain = pd.read_csv(\"Train.csv\",encoding = 'utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "dtest = pd.read_csv(\"Test.csv\",encoding = 'ISO-8859-1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('dtrain :', (87020, 26))\n"
     ]
    }
   ],
   "source": [
    "print(\"dtrain :\", dtrain.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('dtest :', (37717, 24))\n"
     ]
    }
   ],
   "source": [
    "print(\"dtest :\", dtest.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "添加另外一个特征值用于区别训练数据和测试数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ice2018/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:4: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version\n",
      "of pandas will change to not sort by default.\n",
      "\n",
      "To accept the future behavior, pass 'sort=False'.\n",
      "\n",
      "To retain the current behavior and silence the warning, pass 'sort=True'.\n",
      "\n",
      "  after removing the cwd from sys.path.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>City</th>\n",
       "      <th>DOB</th>\n",
       "      <th>Data_source</th>\n",
       "      <th>Device_Type</th>\n",
       "      <th>Disbursed</th>\n",
       "      <th>EMI_Loan_Submitted</th>\n",
       "      <th>Employer_Name</th>\n",
       "      <th>Existing_EMI</th>\n",
       "      <th>Filled_Form</th>\n",
       "      <th>Gender</th>\n",
       "      <th>...</th>\n",
       "      <th>LoggedIn</th>\n",
       "      <th>Mobile_Verified</th>\n",
       "      <th>Monthly_Income</th>\n",
       "      <th>Processing_Fee</th>\n",
       "      <th>Salary_Account</th>\n",
       "      <th>Source</th>\n",
       "      <th>Var1</th>\n",
       "      <th>Var2</th>\n",
       "      <th>Var4</th>\n",
       "      <th>Var5</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Delhi</td>\n",
       "      <td>23-May-78</td>\n",
       "      <td>dtrain</td>\n",
       "      <td>Web-browser</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>CYBOSOL</td>\n",
       "      <td>0.0</td>\n",
       "      <td>N</td>\n",
       "      <td>Female</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>N</td>\n",
       "      <td>20000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>HDFC Bank</td>\n",
       "      <td>S122</td>\n",
       "      <td>HBXX</td>\n",
       "      <td>G</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Mumbai</td>\n",
       "      <td>07-Oct-85</td>\n",
       "      <td>dtrain</td>\n",
       "      <td>Web-browser</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6762.9</td>\n",
       "      <td>TATA CONSULTANCY SERVICES LTD (TCS)</td>\n",
       "      <td>0.0</td>\n",
       "      <td>N</td>\n",
       "      <td>Male</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Y</td>\n",
       "      <td>35000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ICICI Bank</td>\n",
       "      <td>S122</td>\n",
       "      <td>HBXA</td>\n",
       "      <td>G</td>\n",
       "      <td>3</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Panchkula</td>\n",
       "      <td>10-Oct-81</td>\n",
       "      <td>dtrain</td>\n",
       "      <td>Web-browser</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ALCHEMIST HOSPITALS LTD</td>\n",
       "      <td>0.0</td>\n",
       "      <td>N</td>\n",
       "      <td>Male</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Y</td>\n",
       "      <td>22500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>State Bank of India</td>\n",
       "      <td>S143</td>\n",
       "      <td>HBXX</td>\n",
       "      <td>B</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Saharsa</td>\n",
       "      <td>30-Nov-87</td>\n",
       "      <td>dtrain</td>\n",
       "      <td>Web-browser</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>BIHAR GOVERNMENT</td>\n",
       "      <td>0.0</td>\n",
       "      <td>N</td>\n",
       "      <td>Male</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Y</td>\n",
       "      <td>35000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>State Bank of India</td>\n",
       "      <td>S143</td>\n",
       "      <td>HBXX</td>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Bengaluru</td>\n",
       "      <td>17-Feb-84</td>\n",
       "      <td>dtrain</td>\n",
       "      <td>Web-browser</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>GLOBAL EDGE SOFTWARE</td>\n",
       "      <td>25000.0</td>\n",
       "      <td>N</td>\n",
       "      <td>Male</td>\n",
       "      <td>...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Y</td>\n",
       "      <td>100000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>HDFC Bank</td>\n",
       "      <td>S134</td>\n",
       "      <td>HBXX</td>\n",
       "      <td>B</td>\n",
       "      <td>3</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        City        DOB Data_source  Device_Type  Disbursed  \\\n",
       "0      Delhi  23-May-78      dtrain  Web-browser        0.0   \n",
       "1     Mumbai  07-Oct-85      dtrain  Web-browser        0.0   \n",
       "2  Panchkula  10-Oct-81      dtrain  Web-browser        0.0   \n",
       "3    Saharsa  30-Nov-87      dtrain  Web-browser        0.0   \n",
       "4  Bengaluru  17-Feb-84      dtrain  Web-browser        0.0   \n",
       "\n",
       "   EMI_Loan_Submitted                        Employer_Name  Existing_EMI  \\\n",
       "0                 NaN                              CYBOSOL           0.0   \n",
       "1              6762.9  TATA CONSULTANCY SERVICES LTD (TCS)           0.0   \n",
       "2                 NaN              ALCHEMIST HOSPITALS LTD           0.0   \n",
       "3                 NaN                     BIHAR GOVERNMENT           0.0   \n",
       "4                 NaN                 GLOBAL EDGE SOFTWARE       25000.0   \n",
       "\n",
       "  Filled_Form  Gender  ...  LoggedIn  Mobile_Verified Monthly_Income  \\\n",
       "0           N  Female  ...       0.0                N          20000   \n",
       "1           N    Male  ...       0.0                Y          35000   \n",
       "2           N    Male  ...       0.0                Y          22500   \n",
       "3           N    Male  ...       0.0                Y          35000   \n",
       "4           N    Male  ...       1.0                Y         100000   \n",
       "\n",
       "   Processing_Fee       Salary_Account  Source  Var1  Var2 Var4  Var5  \n",
       "0             NaN            HDFC Bank    S122  HBXX     G    1     0  \n",
       "1             NaN           ICICI Bank    S122  HBXA     G    3    13  \n",
       "2             NaN  State Bank of India    S143  HBXX     B    1     0  \n",
       "3             NaN  State Bank of India    S143  HBXX     B    3    10  \n",
       "4             NaN            HDFC Bank    S134  HBXX     B    3    17  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#数据融合\n",
    "dtrain['Data_source']= 'dtrain'\n",
    "dtest['Data_source'] = 'dtest'\n",
    "data_all=pd.concat([dtrain, dtest],ignore_index=True)\n",
    "data_all.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "#dtrain['Data of Train']=dtrain['ID'].apply(lambda x:1)\n",
    "#dtrain[['ID','Data of Train']].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "#dtest['Data of Train']=dtest['ID'].apply(lambda x:0)\n",
    "#dtest[['ID','Data of Train']].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(124737, 27)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_all.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 124737 entries, 0 to 124736\n",
      "Data columns (total 27 columns):\n",
      "City                     123336 non-null object\n",
      "DOB                      124737 non-null object\n",
      "Data_source              124737 non-null object\n",
      "Device_Type              124737 non-null object\n",
      "Disbursed                87020 non-null float64\n",
      "EMI_Loan_Submitted       39836 non-null float64\n",
      "Employer_Name            124624 non-null object\n",
      "Existing_EMI             124626 non-null float64\n",
      "Filled_Form              124737 non-null object\n",
      "Gender                   124737 non-null object\n",
      "ID                       124737 non-null object\n",
      "Interest_Rate            39836 non-null float64\n",
      "Lead_Creation_Date       124737 non-null object\n",
      "Loan_Amount_Applied      124626 non-null float64\n",
      "Loan_Amount_Submitted    75202 non-null float64\n",
      "Loan_Tenure_Applied      124626 non-null float64\n",
      "Loan_Tenure_Submitted    75202 non-null float64\n",
      "LoggedIn                 87020 non-null float64\n",
      "Mobile_Verified          124737 non-null object\n",
      "Monthly_Income           124737 non-null int64\n",
      "Processing_Fee           39391 non-null float64\n",
      "Salary_Account           107936 non-null object\n",
      "Source                   124737 non-null object\n",
      "Var1                     124737 non-null object\n",
      "Var2                     124737 non-null object\n",
      "Var4                     124737 non-null int64\n",
      "Var5                     124737 non-null int64\n",
      "dtypes: float64(10), int64(3), object(14)\n",
      "memory usage: 25.7+ MB\n"
     ]
    }
   ],
   "source": [
    "#查看数据信息\n",
    "data_all.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到有部分数据数据有丢失，针对不同样本数据丢失的情况，选择不同的处理方式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 检查类别对象的类别数目"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Frequency count for variable Gender\n",
      "Male      71398\n",
      "Female    53339\n",
      "Name: Gender, dtype: int64\n",
      "\n",
      "Frequency count for variable City\n",
      "Delhi                  17936\n",
      "Bengaluru              15522\n",
      "Mumbai                 15425\n",
      "Hyderabad              10410\n",
      "Chennai                 9895\n",
      "Pune                    7427\n",
      "Kolkata                 4282\n",
      "Ahmedabad               2528\n",
      "Jaipur                  1892\n",
      "Gurgaon                 1743\n",
      "Coimbatore              1659\n",
      "Thane                   1306\n",
      "Chandigarh              1266\n",
      "Surat                   1149\n",
      "Visakhapatnam           1080\n",
      "Indore                  1051\n",
      "Vadodara                 893\n",
      "Nagpur                   879\n",
      "Lucknow                  813\n",
      "Ghaziabad                795\n",
      "Bhopal                   735\n",
      "Kochi                    692\n",
      "Patna                    675\n",
      "Faridabad                651\n",
      "Noida                    549\n",
      "Madurai                  534\n",
      "Gautam Buddha Nagar      485\n",
      "Dehradun                 444\n",
      "Raipur                   430\n",
      "Bhubaneswar              407\n",
      "                       ...  \n",
      "Sawai Madhopur             1\n",
      "SOMNATH JUNAGADHA          1\n",
      "Chinnamiram                1\n",
      "Bandipore                  1\n",
      "CHIKHLI (GUJ.)             1\n",
      "Raisen                     1\n",
      "Dhalai                     1\n",
      "Rudraprayag                1\n",
      "Panna                      1\n",
      "UDWADA                     1\n",
      "Siruguppa                  1\n",
      "Kullu                      1\n",
      "Seoni                      1\n",
      "Umaria                     1\n",
      "Mainpuri                   1\n",
      "LUNAWADA                   1\n",
      "Champhai                   1\n",
      "Malkangiri                 1\n",
      "Hazaribagh                 1\n",
      "Tawang                     1\n",
      "Champawat                  1\n",
      "DHANDHUKA                  1\n",
      "Ramanagara                 1\n",
      "Magadh                     1\n",
      "Sheikhpura                 1\n",
      "Lohit                      1\n",
      "Bageshwar                  1\n",
      "Latehar                    1\n",
      "Pulwama                    1\n",
      "SAYAN                      1\n",
      "Name: City, Length: 723, dtype: int64\n",
      "\n",
      "Frequency count for variable Employer_Name\n",
      "0                                               6900\n",
      "TATA CONSULTANCY SERVICES LTD (TCS)              754\n",
      "COGNIZANT TECHNOLOGY SOLUTIONS INDIA PVT LTD     558\n",
      "ACCENTURE SERVICES PVT LTD                       476\n",
      "GOOGLE                                           408\n",
      "ICICI BANK LTD                                   337\n",
      "HCL TECHNOLOGIES LTD                             337\n",
      "IBM CORPORATION                                  265\n",
      "INDIAN AIR FORCE                                 258\n",
      "INFOSYS TECHNOLOGIES                             257\n",
      "INDIAN ARMY                                      243\n",
      "GENPACT                                          240\n",
      "WIPRO TECHNOLOGIES                               235\n",
      "TYPE SLOWLY FOR AUTO FILL                        219\n",
      "IKYA HUMAN CAPITAL SOLUTIONS LTD                 204\n",
      "ARMY                                             203\n",
      "INDIAN RAILWAY                                   201\n",
      "HDFC BANK LTD                                    201\n",
      "STATE GOVERNMENT                                 199\n",
      "WIPRO BPO                                        186\n",
      "INDIAN NAVY                                      183\n",
      "CONVERGYS INDIA SERVICES PVT LTD                 165\n",
      "OTHERS                                           159\n",
      "TECH MAHINDRA LTD                                158\n",
      "IBM GLOBAL SERVICES INDIA LTD                    158\n",
      "CONCENTRIX DAKSH SERVICES INDIA PVT LTD          154\n",
      "CAPGEMINI INDIA PVT LTD                          152\n",
      "SERCO BPO PVT LTD                                149\n",
      "SUTHERLAND GLOBAL SERVICES PVT LTD               141\n",
      "ADECCO INDIA PVT LTD                             140\n",
      "                                                ... \n",
      "VENKATESHARA RESEACH AND BREEDING                  1\n",
      "LAKHAN LAL SINHA                                   1\n",
      "HAJEE A P BAVA COM CONSTRUCT P LTD                 1\n",
      "SUPER HITER ENTERPRISE                             1\n",
      "PRATIK                                             1\n",
      "RAMANJANEYULU G                                    1\n",
      "MEGA BYTE CORPORATION                              1\n",
      "VEDANJAY POWER PRIVATE LIMITED                     1\n",
      "HOTEL PRIDE AMBER VILAS                            1\n",
      "ASC CENTRE AND COLLEGE                             1\n",
      "VEEPURI. SUJATHA                                   1\n",
      "RICO ALUMINIUM N FERROUS AUTO COM L                1\n",
      "ESS TEE EXPORT                                     1\n",
      "UHGIT                                              1\n",
      "AMBEST PRINTS PVT LTD                              1\n",
      "NNK                                                1\n",
      "SJR PRIME CORPORATION PVT LTD                      1\n",
      "BAJAJ ALLIANZ GIC                                  1\n",
      "SKY GROUP CONSULTING                               1\n",
      "A. A. NAYAK CONSTRUCTIONS PVT. LTD.                1\n",
      "RAJUU                                              1\n",
      "HARSHAD PANCHAL                                    1\n",
      "SHREE BALAJI CONSULTANCY                           1\n",
      "HILTON GARDEN INN                                  1\n",
      "VISHAL SHIPPING AGENCIES PVT LTD                   1\n",
      "CHHATTU SHIKARI                                    1\n",
      "MANZOOR ALI                                        1\n",
      "PARIKH AGENCY                                      1\n",
      "SRINIVAS PRADHAN                                   1\n",
      "AAKRITI FURNISHERS PVT LTD                         1\n",
      "Name: Employer_Name, Length: 57193, dtype: int64\n",
      "\n",
      "Frequency count for variable Salary_Account\n",
      "HDFC Bank                                          25180\n",
      "ICICI Bank                                         19547\n",
      "State Bank of India                                17110\n",
      "Axis Bank                                          12590\n",
      "Citibank                                            3398\n",
      "Kotak Bank                                          2955\n",
      "IDBI Bank                                           2213\n",
      "Punjab National Bank                                1747\n",
      "Bank of India                                       1713\n",
      "Bank of Baroda                                      1675\n",
      "Standard Chartered Bank                             1434\n",
      "Canara Bank                                         1385\n",
      "Union Bank of India                                 1330\n",
      "Yes Bank                                            1120\n",
      "ING Vysya                                            996\n",
      "Corporation bank                                     948\n",
      "Indian Overseas Bank                                 901\n",
      "State Bank of Hyderabad                              854\n",
      "Indian Bank                                          773\n",
      "Oriental Bank of Commerce                            761\n",
      "IndusInd Bank                                        711\n",
      "Andhra Bank                                          706\n",
      "Central Bank of India                                648\n",
      "Syndicate Bank                                       614\n",
      "Bank of Maharasthra                                  576\n",
      "HSBC                                                 474\n",
      "State Bank of Bikaner & Jaipur                       448\n",
      "Karur Vysya Bank                                     435\n",
      "State Bank of Mysore                                 385\n",
      "Federal Bank                                         377\n",
      "Vijaya Bank                                          354\n",
      "Allahabad Bank                                       345\n",
      "UCO Bank                                             344\n",
      "State Bank of Travancore                             333\n",
      "Karnataka Bank                                       279\n",
      "United Bank of India                                 276\n",
      "Dena Bank                                            268\n",
      "Saraswat Bank                                        265\n",
      "State Bank of Patiala                                263\n",
      "South Indian Bank                                    223\n",
      "Deutsche Bank                                        176\n",
      "Abhyuday Co-op Bank Ltd                              161\n",
      "The Ratnakar Bank Ltd                                113\n",
      "Tamil Nadu Mercantile Bank                           103\n",
      "Punjab & Sind bank                                    84\n",
      "J&K Bank                                              78\n",
      "Lakshmi Vilas bank                                    69\n",
      "Dhanalakshmi Bank Ltd                                 66\n",
      "State Bank of Indore                                  32\n",
      "Catholic Syrian Bank                                  27\n",
      "India Bulls                                           21\n",
      "B N P Paribas                                         15\n",
      "Firstrand Bank Limited                                11\n",
      "GIC Housing Finance Ltd                               10\n",
      "Bank of Rajasthan                                      8\n",
      "Kerala Gramin Bank                                     4\n",
      "Industrial And Commercial Bank Of China Limited        3\n",
      "Ahmedabad Mercantile Cooperative Bank                  1\n",
      "Name: Salary_Account, dtype: int64\n",
      "\n",
      "Frequency count for variable Mobile_Verified\n",
      "Y    80928\n",
      "N    43809\n",
      "Name: Mobile_Verified, dtype: int64\n",
      "\n",
      "Frequency count for variable Var1\n",
      "HBXX    84901\n",
      "HBXC    12952\n",
      "HBXB     6502\n",
      "HAXA     4214\n",
      "HBXA     3042\n",
      "HAXB     2879\n",
      "HBXD     2818\n",
      "HAXC     2171\n",
      "HBXH     1387\n",
      "HCXF      990\n",
      "HAYT      710\n",
      "HAVC      570\n",
      "HAXM      386\n",
      "HCXD      348\n",
      "HCYS      318\n",
      "HVYS      252\n",
      "HAZD      161\n",
      "HCXG      114\n",
      "HAXF       22\n",
      "Name: Var1, dtype: int64\n",
      "\n",
      "Frequency count for variable Filled_Form\n",
      "N    96740\n",
      "Y    27997\n",
      "Name: Filled_Form, dtype: int64\n",
      "\n",
      "Frequency count for variable Device_Type\n",
      "Web-browser    92105\n",
      "Mobile         32632\n",
      "Name: Device_Type, dtype: int64\n",
      "\n",
      "Frequency count for variable Var2\n",
      "B    53481\n",
      "G    47338\n",
      "C    20366\n",
      "E     1855\n",
      "D      918\n",
      "F      770\n",
      "A        9\n",
      "Name: Var2, dtype: int64\n",
      "\n",
      "Frequency count for variable Source\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "S122    55249\n",
      "S133    42900\n",
      "S159     7999\n",
      "S143     6140\n",
      "S127     2804\n",
      "S137     2450\n",
      "S134     1900\n",
      "S161     1109\n",
      "S151     1018\n",
      "S157      929\n",
      "S153      705\n",
      "S144      447\n",
      "S156      432\n",
      "S158      294\n",
      "S123      112\n",
      "S141       83\n",
      "S162       60\n",
      "S124       43\n",
      "S150       19\n",
      "S160       11\n",
      "S136        5\n",
      "S138        5\n",
      "S155        5\n",
      "S139        4\n",
      "S129        4\n",
      "S135        2\n",
      "S131        1\n",
      "S130        1\n",
      "S132        1\n",
      "S125        1\n",
      "S140        1\n",
      "S142        1\n",
      "S126        1\n",
      "S154        1\n",
      "Name: Source, dtype: int64\n",
      "\n",
      "Frequency count for variable Var4\n",
      "3    36280\n",
      "1    34316\n",
      "5    29092\n",
      "4     9411\n",
      "2     8481\n",
      "0     3564\n",
      "7     3264\n",
      "6      329\n",
      "Name: Var4, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "var = ['Gender','City','Employer_Name','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source','Var4']\n",
    "for v in var:\n",
    "    print '\\nFrequency count for variable %s'%v\n",
    "    print data_all[v].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从上述数据可以看出，有部分特征的类别种类过多，可以考虑删除。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 城市特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "#City特征处理--从上面可以看出，城市的类别总共有723个，这里将其去除\n",
    "#len(data_all['City'].unique())\n",
    "data_all.drop('City',axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 贷款申请年龄处理(由贷款申请年龄减去出生日期)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DOB</th>\n",
       "      <th>Lead_Creation_Date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>23-May-78</td>\n",
       "      <td>15-May-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>07-Oct-85</td>\n",
       "      <td>04-May-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10-Oct-81</td>\n",
       "      <td>19-May-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>30-Nov-87</td>\n",
       "      <td>09-May-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>17-Feb-84</td>\n",
       "      <td>20-May-15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         DOB Lead_Creation_Date\n",
       "0  23-May-78          15-May-15\n",
       "1  07-Oct-85          04-May-15\n",
       "2  10-Oct-81          19-May-15\n",
       "3  30-Nov-87          09-May-15\n",
       "4  17-Feb-84          20-May-15"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_all[['DOB','Lead_Creation_Date']].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    37\n",
       "1    30\n",
       "2    34\n",
       "3    28\n",
       "4    31\n",
       "5    33\n",
       "6    28\n",
       "7    40\n",
       "8    43\n",
       "9    26\n",
       "Name: Age, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#创建年龄特征\n",
    "data_all['Age'] = data_all['DOB'].apply(lambda x: 100 - int(x[-2:]))\n",
    "data_all['Age']=data_all['Age']+data_all['Lead_Creation_Date'].apply(lambda x: int(x[-2:]))\n",
    "data_all['Age'].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "#删除DOB和Lead_Creation_Date两个特征\n",
    "data_all.drop('DOB',axis=1,inplace=True)\n",
    "data_all.drop('Lead_Creation_Date',axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### EMI_Loan_Submitted特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x7f06940b64d0>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY0AAAD9CAYAAABA8iukAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAG55JREFUeJzt3X901fWd5/HnSyKEShVtaw4SzuK0nG6Q2ZmpGXRGtycpM4DWLc6cdivtjLRNy8Eqbdd1FMmepdM2MzKdU6e2AosNC3bbYMdpVzq1IovJdrKjVNRW0NQStSMpjNYFWbEa+fHeP+4n9hIuyZd7ubm58fU45558v+/v5/v9fsJJ8uLz+X7v9yoiMDMzy+K0SnfAzMyqh0PDzMwyc2iYmVlmDg0zM8vMoWFmZpk5NMzMLDOHhpmZZebQMDOzzBwaZmaWWU2lO3Cqvf3tb4/p06dXuhtmx3nllVc444wzKt0Ns4IeeeSRFyPiHcO1G3OhMX36dLZv317pbpgdp6uri6ampkp3w6wgSf+SpZ2np8zMLDOHhpmZZebQMDOzzBwaZmaWmUPDzMwyc2iYlVlHRwezZs1izpw5zJo1i46Ojkp3yaxoY+6WW7PRpKOjg9bWVtrb2zly5Ajjxo2jpaUFgIULF1a4d2YnzyMNszJqa2ujvb2d5uZmampqaG5upr29nba2tkp3zawoDg2zMurp6eHSSy89pnbppZfS09NToR6ZlcahYVZGDQ0NdHd3H1Pr7u6moaGhQj0yK41Dw6yMWltbaWlpobOzk8OHD9PZ2UlLSwutra2V7ppZUYa9EC5pHXAF8EJEzBq07Qbgy8A7IuJFSQK+ClwO/Br4WEQ8mtouAv5L2vVLEbEh1S8E1gMTgXuBz0ZESDoHuAuYDvwC+I8Rsb+k79ZshA1c7F66dCk9PT00NDTQ1tbmi+BWtbKMNNYD8wcXJU0D/hh4Lq98GTAjvRYDq1Pbc4AVwEXAbGCFpLPTPqtT24H9Bs61DNgaETOArWndrOosXLiQnTt3snXrVnbu3OnAsKo2bGhExI+AfQU23QrcCERebQFwZ+Q8BEyWNAWYB2yJiH1ptLAFmJ+2nRkRD0ZEAHcCV+Yda0Na3pBXNzOzCinqmoakDwC/jIifDto0Fdidt96XakPV+wrUAeoiYi9A+npuMX01M7NT56Tf3CfpLUArMLfQ5gK1KKJ+sn1aTG6Ki7q6Orq6uk72EGZld/DgQf9sWtUr5h3h7wTOB36au+5NPfCopNnkRgrT8trWA3tSvWlQvSvV6wu0B3he0pSI2JumsV44UYciYi2wFqCxsTH8QTc2GvlDmGwsOOnpqYjYERHnRsT0iJhO7g//eyLiX4FNwNXKuRg4kKaWNgNzJZ2dLoDPBTanbS9LujjdeXU1cE861SZgUVpelFc3M7MKGTY0JHUADwLvltQnqWWI5vcCzwC9wB3ApwEiYh/wReDh9PpCqgFcA3wj7fM08MNUvwX4Y0m7yN2ldcvJfWtmo8PSpUupra2lubmZ2tpali5dWukumRVt2OmpiBjy/sA02hhYDuDaE7RbB6wrUN8OzCpQ/7/AnOH6ZzaaLV26lDVr1rBy5UpmzpzJk08+yU033QTA1772tQr3zuzk+R3hZmV0xx13sHLlSq6//npqa2u5/vrrWblyJXfccUelu2ZWFIeGWRn19/ezZMmSY2pLliyhv7+/Qj0yK41Dw6yMJkyYwJo1a46prVmzhgkTJlSoR2al8YcwmZXRpz71qTeuYcycOZOvfOUr3HTTTceNPsyqhUPDrIwGLnYvX76c/v5+JkyYwJIlS3wR3KqWcjc8jR2NjY2xffv2SnfD7Dh+c5+NZpIeiYjG4dr5moaZmWXm0DAzs8wcGmZmlplDw8zMMnNomJlZZg4NMzPLzKFhZmaZOTTMzCwzh4aZmWXm0DAzs8wcGmZmlplDw8zMMnNomJlZZsOGhqR1kl6QtDOv9mVJP5P0uKTvSZqct+1mSb2SnpI0L68+P9V6JS3Lq58vaZukXZLukjQ+1Sek9d60ffqp+qbNzKw4WUYa64H5g2pbgFkR8e+AnwM3A0iaCVwFXJD2WSVpnKRxwO3AZcBMYGFqC7ASuDUiZgD7gZZUbwH2R8S7gFtTOzMzq6BhQyMifgTsG1S7PyIOp9WHgPq0vADYGBH9EfEs0AvMTq/eiHgmIl4HNgILJAl4H3B32n8DcGXesTak5buBOam9mZlVyKn45L5PAHel5ankQmRAX6oB7B5Uvwh4G/BSXgDlt586sE9EHJZ0ILV/cXAHJC0GFgPU1dXR1dVV2ndkVgYHDx70z6ZVvZJCQ1IrcBj41kCpQLOg8Igmhmg/1LGOL0asBdZC7pP7/OloNhr5k/tsLCg6NCQtAq4A5sRvPjO2D5iW16we2JOWC9VfBCZLqkmjjfz2A8fqk1QDnMWgaTIzMxtZRd1yK2k+cBPwgYj4dd6mTcBV6c6n84EZwI+Bh4EZ6U6p8eQulm9KYdMJfDDtvwi4J+9Yi9LyB4EHYqx9oLmZWZUZdqQhqQNoAt4uqQ9YQe5uqQnAlnRt+qGIWBIRT0j6DvAkuWmrayPiSDrOdcBmYBywLiKeSKe4Cdgo6UvAY0B7qrcD35TUS26EcdUp+H7NzKwEw4ZGRCwsUG4vUBto3wa0FajfC9xboP4MuburBtdfAz40XP/MzGzk+B3hZmaWmUPDzMwyc2iYmVlmDg0zM8vMoWFmZpk5NMzMLDOHhpmZZebQMDOzzBwaZmaWmUPDzMwyc2iYmVlmDg0zM8vMoWFmZpk5NMzMLDOHhpmZZebQMDOzzBwaZmaWmUPDzMwyGzY0JK2T9IKknXm1cyRtkbQrfT071SXpNkm9kh6X9J68fRal9rskLcqrXyhpR9rnNqUPHT/ROczMrHKyjDTWA/MH1ZYBWyNiBrA1rQNcBsxIr8XAasgFALACuIjc54GvyAuB1antwH7zhzmHmZlVyLChERE/AvYNKi8ANqTlDcCVefU7I+chYLKkKcA8YEtE7IuI/cAWYH7admZEPBgRAdw56FiFzmFmZhVS7DWNuojYC5C+npvqU4Hdee36Um2oel+B+lDnMDOzCqk5xcdTgVoUUT+5k0qLyU1xUVdXR1dX18kewqzsDh486J9Nq3rFhsbzkqZExN40xfRCqvcB0/La1QN7Ur1pUL0r1esLtB/qHMeJiLXAWoDGxsZoamo6UVOzEdfR0UFbWxs9PT00NDTQ2trKwoULK90ts6IUGxqbgEXALenrPXn16yRtJHfR+0D6o78Z+Ku8i99zgZsjYp+klyVdDGwDrga+Nsw5zKpGR0cHra2ttLe3c+TIEcaNG0dLSwuAg8OqU0QM+QI6gL3AIXIjgxbgbeTuaNqVvp6T2gq4HXga2AE05h3nE0Bven08r94I7Ez7fB1Qqhc8x3CvCy+8MMxGiwsuuCAeeOCBiIjo7OyMiIgHHnggLrjgggr2yux4wPbI8Dd24A/0mNHY2Bjbt2+vdDfMABg3bhyvvfYap59+Ol1dXTQ1NXHo0CFqa2s5cuRIpbtn9gZJj0RE43Dt/I5wszJqaGigu7v7mFp3dzcNDQ0V6pFZaRwaZmXU2tpKS0sLnZ2dHD58mM7OTlpaWmhtba1018yKcqpvuTWzPAMXu5cuXfrG3VNtbW2+CG5Vy9c0zEbIwDUNs9HI1zTMzOyUc2iYmVlmDg2zMuvo6GDWrFnMmTOHWbNm0dHRUekumRXNF8LNysjvCLexxiMNszJqa2ujvb2d5uZmampqaG5upr29nba2tkp3zawoDg2zMurp6aGvr++Y6am+vj56enoq3TWzonh6yqyMzjvvPG688Ua+/e1vvzE99ZGPfITzzjuv0l0zK4pHGmZllj72/oTrZtXEIw2zMtqzZw/r168/5h3hK1eu5GMf+1ilu2ZWFI80zMqooaGB+vp6du7cydatW9m5cyf19fV+YKFVLYeGWRn5gYU21nh6yqyM/MBCG2v8wEKzEeIHFtpo5gcWmpnZKefQMDOzzEoKDUn/SdITknZK6pBUK+l8Sdsk7ZJ0l6Txqe2EtN6btk/PO87Nqf6UpHl59fmp1itpWSl9NTOz0hUdGpKmAp8BGiNiFjAOuApYCdwaETOA/UBL2qUF2B8R7wJuTe2QNDPtdwEwH1glaZykccDtwGXATGBhamtWVfyUWxtLSr17qgaYKOkQ8BZgL/A+4CNp+wbg88BqYEFaBrgb+Lpyb41dAGyMiH7gWUm9wOzUrjcingGQtDG1fbLEPpuNGD/l1saaokMjIn4p6W+B54BXgfuBR4CXIuJwatYHTE3LU4Hdad/Dkg4Ab0v1h/IOnb/P7kH1iwr1RdJiYDFAXV0dXV1dxX5bZqfU8uXL+cxnPoMkXnvtNSZNmsTSpUtZvnw5U6ZMqXT3zE5a0aEh6Wxy//M/H3gJ+HtyU0mDDdzTW+iBOzFEvdDUWcH7gyNiLbAWcrfc+rZGGy2ee+45rrvuOk4//fQ3brm95JJLuOGGG3z7rVWlUi6E/xHwbET8KiIOAd8F/hCYLGkgjOqBPWm5D5gGkLafBezLrw/a50R1s6rR0NBAd3f3MbXu7m4/RsSqVimh8RxwsaS3pGsTc8hdb+gEPpjaLALuScub0jpp+wORe2fhJuCqdHfV+cAM4MfAw8CMdDfWeHIXyzeV0F+zEefHiNhYU8o1jW2S7gYeBQ4Dj5GbIvoBsFHSl1KtPe3SDnwzXejeRy4EiIgnJH2HXOAcBq6NiCMAkq4DNpO7M2tdRDxRbH/NKsGPEbGxxo8RMRshfoyIjWZ+jIiZmZ1yDg0zM8vMoWFmZpk5NMzMLDOHhpmZZebQMDOzzBwaZmU2b948TjvtNJqbmznttNOYN2/e8DuZjVIODbMymjdvHvfffz9Llizh+9//PkuWLOH+++93cFjVKvXR6GY2hC1btnDNNdewatUqurq6WLVqFQBr1qypcM/MiuN3hJuVkSTOOussDhw48EZtYH2s/e5ZdfM7ws1GifzAKLRuVk0cGmYjYNKkSaxevZpJkyZVuitmJfE1DbMyGz9+PK+88grXXHMNkhg/fjyvv/56pbtlVhSPNMzKbNq0aRw9epTOzk6OHj3KtGnTht/JbJTySMOszJ5++mlyn1NmVv080jAro5qawv8vO1HdbLRzaJiV0ZEjR06qbjbaOTTMyigi+OQnP0lE0NnZecy6WTUqKTQkTZZ0t6SfSeqR9AeSzpG0RdKu9PXs1FaSbpPUK+lxSe/JO86i1H6XpEV59Qsl7Uj73CZPDFsV2rFjB7W1tTQ3N1NbW8uOHTsq3SWzopU60vgqcF9E/Fvgd4AeYBmwNSJmAFvTOsBlwIz0WgysBpB0DrACuAiYDawYCJrUZnHefvNL7K/ZiNu2bRv9/f0A9Pf3s23btgr3yKx4RYeGpDOB9wLtABHxekS8BCwANqRmG4Ar0/IC4M7IeQiYLGkKMA/YEhH7ImI/sAWYn7adGREPRm4sf2fesczMrAJKGWn8FvAr4L9LekzSNySdAdRFxF6A9PXc1H4qsDtv/75UG6reV6BuZmYVUsp9fzXAe4ClEbFN0lf5zVRUIYWuR0QR9eMPLC0mN41FXV0dXV1dQ3TDbORNnDiRV1999Y2vgH9OrSqVEhp9QF9EDEzQ3k0uNJ6XNCUi9qYpphfy2ue/FbYe2JPqTYPqXaleX6D9cSJiLbAWck+5bWpqKtTMrGLyr2kM8M+pVaOip6ci4l+B3ZLenUpzgCeBTcDAHVCLgHvS8ibg6nQX1cXAgTR9tRmYK+nsdAF8LrA5bXtZ0sXprqmr845lVlWOHj16zFezalXq21KXAt+SNB54Bvg4uSD6jqQW4DngQ6ntvcDlQC/w69SWiNgn6YvAw6ndFyJiX1q+BlgPTAR+mF5mZlYh/hAmszIa6q1FY+13z6qbP4TJzMxOOYeG2QgYGHH4oQZW7RwaZiNgYCrKU1JW7RwaZmaWmUPDzMwyc2iYmVlmDg0zM8vMoWFmZpk5NMzMLDOHhpmZZebQMDOzzBwaZmaWmUPDzMwyc2iYmVlmDg0zM8vMoWFmZpk5NMzMLDOHhpmZZebQMDOzzEoODUnjJD0m6R/T+vmStknaJekuSeNTfUJa703bp+cd4+ZUf0rSvLz6/FTrlbSs1L6amVlpTsVI47NAT976SuDWiJgB7AdaUr0F2B8R7wJuTe2QNBO4CrgAmA+sSkE0DrgduAyYCSxMbc3MrEJKCg1J9cD7gW+kdQHvA+5OTTYAV6blBWmdtH1Oar8A2BgR/RHxLNALzE6v3oh4JiJeBzamtmZmViGljjT+DrgROJrW3wa8FBGH03ofMDUtTwV2A6TtB1L7N+qD9jlR3czMKqSm2B0lXQG8EBGPSGoaKBdoGsNsO1G9UKBFgRqSFgOLAerq6ujq6jpxx81GCf+cWjUqOjSAS4APSLocqAXOJDfymCypJo0m6oE9qX0fMA3ok1QDnAXsy6sPyN/nRPVjRMRaYC1AY2NjNDU1lfBtmY0M/5xaNSp6eioibo6I+oiYTu5C9gMR8VGgE/hgarYIuCctb0rrpO0PRESk+lXp7qrzgRnAj4GHgRnpbqzx6Rybiu2vmZmVrpSRxoncBGyU9CXgMaA91duBb0rqJTfCuAogIp6Q9B3gSeAwcG1EHAGQdB2wGRgHrIuIJ8rQXzMzy0i5/+yPHY2NjbF9+/ZKd8MMgNwNgoWNtd89q26SHomIxuHa+R3hZmaWmUPDzMwyc2iYmVlmDg0zM8vMoWFmZpk5NMzMLDOHhpmZZebQMDOzzBwaZmaWmUPDzMwyc2iYmVlmDg0zM8vMoWFmZpk5NMzMLDOHhpmZZebQMDOzzBwaZmaWmUPDzMwyKzo0JE2T1CmpR9ITkj6b6udI2iJpV/p6dqpL0m2SeiU9Luk9ecdalNrvkrQor36hpB1pn9s01Gdnmo0wScO+St3fP/I22pQy0jgM/OeIaAAuBq6VNBNYBmyNiBnA1rQOcBkwI70WA6shFzLACuAiYDawYiBoUpvFefvNL6G/ZqdURAz7KnV/f464jTZFh0ZE7I2IR9Pyy0APMBVYAGxIzTYAV6blBcCdkfMQMFnSFGAesCUi9kXEfmALMD9tOzMiHozcb86deccyM7MKOCXXNCRNB34P2AbURcReyAULcG5qNhXYnbdbX6oNVe8rUDerGicaKXgEYdWqptQDSJoE/APwuYj4f0PMwRbaEEXUC/VhMblpLOrq6ujq6hqm12Yjp7OzE4CP3fcK6+efAeCfUataJYWGpNPJBca3IuK7qfy8pCkRsTdNMb2Q6n3AtLzd64E9qd40qN6V6vUF2h8nItYCawEaGxujqampUDOzyrrvB/hn06pdKXdPCWgHeiLiK3mbNgEDd0AtAu7Jq1+d7qK6GDiQpq82A3MlnZ0ugM8FNqdtL0u6OJ3r6rxjmZlZBZQy0rgE+HNgh6SfpNpy4BbgO5JagOeAD6Vt9wKXA73Ar4GPA0TEPklfBB5O7b4QEfvS8jXAemAi8MP0MjOzCik6NCKim8LXHQDmFGgfwLUnONY6YF2B+nZgVrF9NDOzU8vvCDczs8wcGmZmlplDw8zMMnNomJlZZg4NMzPLrOR3hJuNBb/zl/dz4NVDZT/P9GU/KOvxz5p4Oj9dMbes57A3N4eGGXDg1UP84pb3l/UcXV1dZX9HeLlDyczTU2ZmlplDw8zMMnNomJlZZr6mYQa8tWEZv71h2fANS7Vh+CaleGsDQHmvzdibm0PDDHi55xZfCDfLwNNTZmaWmUPDzMwy8/SUWTIiUzv3lf/NfWbl5NAwg7Jfz4BcKI3EeczKydNTZmaWmUPDzMwyc2iYmVlmoz40JM2X9JSkXkkj8O4rMzM7kVEdGpLGAbcDlwEzgYWSZla2V2Zmb16j/e6p2UBvRDwDIGkjsAB4sqK9MgMknfw+K0/+PBFx8juZlcmoHmkAU4Hdeet9qWZWcRFxUq/Ozs6T3seBYaPNaB9pFPqv3HG/RZIWA4sB6urq6OrqKnO3zE7ewYMH/bNpVW+0h0YfMC1vvR7YM7hRRKwF1gI0NjZGuR8KZ1aMkXhgoVm5jfbpqYeBGZLOlzQeuArYVOE+mZm9aY3qkUZEHJZ0HbAZGAesi4gnKtwtM7M3rVEdGgARcS9wb6X7YWZmo396yszMRhGHhpmZZebQMDOzzDTW3jwk6VfAv1S6H2YFvB14sdKdMDuBfxMR7xiu0ZgLDbPRStL2iGisdD/MSuHpKTMzy8yhYWZmmTk0zEbO2kp3wKxUvqZhZmaZeaRhZmaZOTTMzCwzh4aNepKOSPpJ3mtZqndJek55H6En6X9KOpiWp0vaOcRxmyT9Y/m/g2PO+RZJ35K0Q9JOSd2SJg2zT5ekom/VlfSNgY9JlrQ8rz5Z0qeLON7nJd1QbH+suo36BxaaAa9GxO+eYNtLwCVAt6TJwJSR61ZRPgs8HxG/DSDp3cChcp4wIj6Zt7oc+Ku0PBn4NLCqnOe3scUjDat2G8l9zgrAnwLfLfWAkuZIeiyNBtZJmpDq/1XSw2mEsHZghJNGAisl/VjSzyX9+yEOPwX45cBKRDwVEf2DR0WSbpD0+bz9/kzSP6dzz05tPi9pg6T7Jf1C0p9K+pvU7/sknZ7Xv0ZJtwAT02jtW8AtwDvT+pdT279I3+Pjkv4yrz+tkp6S9L+Ad5f2L2zVzKFh1WDgD93A68N527YC75U0jlx43FXKiSTVAuuBD6fRQA1wTdr89Yj4/YiYBUwErsjbtSYiZgOfA1YMcYp1wE2SHpT0JUkzMnbtjIj4Q3Ijg3V59XcC7wcWAP8D6Ez9fjXV3xARy0ijtoj4KLAMeDqt/4WkucAMYDbwu8CFkt4r6UJy/7a/Ry6Yfz9jn20M8vSUVYOhpqeOAN3Ah4GJEfGLvEscxXg38GxE/DytbwCuBf4OaJZ0I/AW4BzgCeD7qd3ACOcRYPqJDh4RP5H0W8Bc4I+AhyX9Abk/8kPpSPv/SNKZaSoO4IcRcUjSDnIfVHZfqu8Yqh8nMDe9Hkvrk8iFyFuB70XErwEk+dMz38QcGjYWbAS+B3z+FByrYOKkEcgqoDEidqepo9q8Jv3p6xGG+b2KiIPkQua7ko4Cl5MbIeWP/GsH73aC9f50zKOSDsVv3nh1dLh+FCDgryPivx1TlD5X4Pz2JuXpKRsL/gn4a9L/xkv0M2C6pHel9T8H/je/+SP+Yrrb6YPFHFzSJZLOTsvjgZnknsr8PHCupLelayhXDNr1w2mfS4EDEXGgmPMDhwaudQAvkxtFDNgMfGLgbi5JUyWdC/wI+BNJEyW9FfgPRZ7bxgCPNKwaTJT0k7z1+9L8PADpf9d/W+Sx50jqy1v/EPBx4O8l1QAPA2vSxeo7yE37/CLVi/FOYHW6iH4a8APgHyIiJH0B2AY8Sy688u2X9M/AmcAnijw35B5l8rikRyPio5L+T7oA/8N0XaMBeDBN8R0E/iwiHpV0F/ATcgH3TyWc36qcHyNiZmaZeXrKzMwy8/SUjXmS5gErB5WfjYg/GUvnNBsJnp4yM7PMPD1lZmaZOTTMzCwzh4aZmWXm0DAzs8wcGmZmltn/B9JRy3+cNRJ7AAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "data_all.boxplot(column=['EMI_Loan_Submitted'],return_type='axes')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EMI_Loan_Submitted</th>\n",
       "      <th>EMI_Loan_Submitted_Missing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6762.90</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6978.92</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>30824.65</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>10883.38</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EMI_Loan_Submitted  EMI_Loan_Submitted_Missing\n",
       "0                 NaN                           1\n",
       "1             6762.90                           0\n",
       "2                 NaN                           1\n",
       "3                 NaN                           1\n",
       "4                 NaN                           1\n",
       "5             6978.92                           0\n",
       "6                 NaN                           1\n",
       "7                 NaN                           1\n",
       "8            30824.65                           0\n",
       "9            10883.38                           0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#由于该特征丢失数据较多，这里新建一个特征用来表示该特征是否丢失\n",
    "data_all['EMI_Loan_Submitted_Missing'] = data_all['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "data_all[['EMI_Loan_Submitted','EMI_Loan_Submitted_Missing']].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "#丢弃EMI_Loan_Submitted特征变量\n",
    "data_all.drop('EMI_Loan_Submitted',axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Employer_Name特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Name: Employer_Name, Length: 57193, dtype: int64 该特征可以丢弃\n",
    "data_all.drop('Employer_Name',axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Salary_Account 特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_all.drop('Salary_Account',axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Existing EMI特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x7f067c7e1f90>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEECAYAAADTdnSRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAESJJREFUeJzt3X+MZfdZ3/H3J7NxCE4cfqw7ImuTtWAbxp4QQqc2lAVmukHYUNmqIGRHrUToNKugeilNi+RokEFuR1UIVVSIabvRhAQQ15j8Aat0E4NgrhK3SWRbOE7skenKSeqtrWbBiZNxAs4OD3/MtRlfz+7cmb2zd+bL+yWNfH4895xH1vFHX3/vPeekqpAkteUlo25AkjR8hrskNchwl6QGGe6S1CDDXZIaZLhLUoNGGu5J3pfkC0k+M0Dtu5M82Pv78yRfuhQ9StJelFH+zj3JDwErwG9V1eQWPncceENV/asda06S9rCRjtyr6qPAU+u3JfmOJB9J8kCSjyX5rg0+Ogt0LkmTkrQH7Rt1Axs4Abytqv5PkhuA3wD+6XM7k7wGuAb40xH1J0m73q4K9ySvAP4J8PtJntv8sr6yo8AHq2r1UvYmSXvJrgp31qaJvlRV33OBmqPAv7lE/UjSnrSrfgpZVV8GPpvkTQBZ8/rn9id5LfDNwMdH1KIk7Qmj/ilkh7Wgfm2SM0nmgH8BzCX5FPAwcMu6j8wCd5WPspSkCxrpTyElSTtjV03LSJKGY2RfqO7fv78OHjw4qtNL5/XMM89w+eWXj7oNaUMPPPDAX1TVlZvVjSzcDx48yP333z+q00vn1e12mZ6eHnUb0oaSfH6QOqdlJKlBhrskNchwl6QGGe6S1CDDXZIatGm4b/ZCjd4jAn4tyekkDyX53uG3Ke28TqfD5OQkR44cYXJykk7Hp0pr7xrkp5DvB94D/NZ59t8EHOr93QD8t94/pT2j0+kwPz/P4uIiq6urjI2NMTc3B8Ds7OyIu5O2btOR+0Yv1OhzC2tvUqqq+gTwTUm+bVgNSpfCwsICi4uLzMzMsG/fPmZmZlhcXGRhYWHUrUnbMoybmA4Aj69bP9Pb9mR/YZJjwDGA8fFxut3uEE4vXbzl5WVWV1fpdrusrKzQ7XZZXV1leXnZ61R70jDCPRts2/BpZFV1grU3LTE1NVXeBajdYmJigrGxMaanp5+/Q3VpaYmJiQnvVtWeNIxfy5wBrl63fhXwxBCOK10y8/PzzM3NsbS0xLlz51haWmJubo75+flRtyZtyzBG7ieBW5PcxdoXqU9X1YumZKTd7LkvTY8fP87y8jITExMsLCz4Zar2rE2f5957ocY0sB/4/8AvAS8FqKr/nrWXnb4HuBH4KvAzVbXpE8GmpqbKB4dpN/LBYdrNkjxQVVOb1W06cq+qCw5dem9F8p2mkrSLeIeqJDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNGijck9yY5NEkp5PctsH+b0+ylOTPkjyU5MeG36okaVCbhnuSMeBO4CbgWmA2ybV9Zb8I3F1VbwCOAr8x7EYlSYMbZOR+PXC6qh6rqmeBu4Bb+moKuKK3/CrgieG1KEnaqn0D1BwAHl+3fga4oa/ml4E/SnIcuBx440YHSnIMOAYwPj5Ot9vdYrvSzltZWfHa1J43SLhng23Vtz4LvL+q/kuS7wd+O8lkVf3NCz5UdQI4ATA1NVXT09PbaFnaWd1uF69N7XWDTMucAa5et34VL552mQPuBqiqjwPfAOwfRoOSpK0bJNzvAw4luSbJZax9YXqyr+b/AkcAkkywFu5nh9moJGlwm4Z7VZ0DbgXuAZZZ+1XMw0nuSHJzr+zfA29N8imgA7ylqvqnbiRJl8ggc+5U1SngVN+229ctPwL8wHBbkyRtl3eoSlKDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYNFO5JbkzyaJLTSW47T81PJXkkycNJfne4bUqStmLfZgVJxoA7gR8BzgD3JTlZVY+sqzkEvAP4gar6YpJ/sFMNS5I2N8jI/XrgdFU9VlXPAncBt/TVvBW4s6q+CFBVXxhum5Kkrdh05A4cAB5ft34GuKGv5h8CJPlfwBjwy1X1kf4DJTkGHAMYHx+n2+1uo2VpZ62srHhtas8bJNyzwbba4DiHgGngKuBjSSar6ksv+FDVCeAEwNTUVE1PT2+1X2nHdbtdvDa11w0yLXMGuHrd+lXAExvU/GFVfb2qPgs8ylrYS5JGYJBwvw84lOSaJJcBR4GTfTV/AMwAJNnP2jTNY8NsVJI0uE3DvarOAbcC9wDLwN1V9XCSO5Lc3Cu7B/jLJI8AS8AvVNVf7lTTkqQLG2TOnao6BZzq23b7uuUC3t77kySNmHeoSlKDDHdJapDhLvV0Oh0mJyc5cuQIk5OTdDqdUbckbdtAc+5S6zqdDvPz8ywuLrK6usrY2Bhzc3MAzM7Ojrg7aescuUvAwsICi4uLzMzMsG/fPmZmZlhcXGRhYWHUrUnbYrhLwPLyMocPH37BtsOHD7O8vDyijqSLY7hLwMTEBPfee+8Ltt17771MTEyMqCPp4hjuEjA/P8/c3BxLS0ucO3eOpaUl5ubmmJ+fH3Vr0rb4harE331pevz4cZaXl5mYmGBhYcEvU7VnZe3m0ktvamqq7r///pGcW7oQnwqp3SzJA1U1tVmd0zKS1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDRoo3JPcmOTRJKeT3HaBup9MUkk2fTO3JGnnbBruScaAO4GbgGuB2STXblD3SuDngE8Ou0lJ0tYMMnK/HjhdVY9V1bPAXcAtG9T9R+BXgL8aYn+SpG3YN0DNAeDxdetngBvWFyR5A3B1VX0oyX8434GSHAOOAYyPj9PtdrfcsLTTVlZWvDa15w0S7tlgWz2/M3kJ8G7gLZsdqKpOACcApqamanp6eqAmpUup2+3itam9bpBpmTPA1evWrwKeWLf+SmAS6Cb5HPB9wEm/VJWk0Rkk3O8DDiW5JsllwFHg5HM7q+rpqtpfVQer6iDwCeDmqrp/RzqWJG1q03CvqnPArcA9wDJwd1U9nOSOJDfvdIOSpK0bZM6dqjoFnOrbdvt5aqcvvi1J0sXwDlVJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUoIHCPcmNSR5NcjrJbRvsf3uSR5I8lORPkrxm+K1Kkga1abgnGQPuBG4CrgVmk1zbV/ZnwFRVfTfwQeBXht2oJGlwg4zcrwdOV9VjVfUscBdwy/qCqlqqqq/2Vj8BXDXcNiVJW7FvgJoDwOPr1s8AN1ygfg748EY7khwDjgGMj4/T7XYH61K6hFZWVrw2tecNEu7ZYFttWJj8S2AK+OGN9lfVCeAEwNTUVE1PTw/WpXQJdbtdvDa11w0S7meAq9etXwU80V+U5I3APPDDVfXXw2lPkrQdg8y53wccSnJNksuAo8DJ9QVJ3gD8D+DmqvrC8NuUJG3FpuFeVeeAW4F7gGXg7qp6OMkdSW7ulb0LeAXw+0keTHLyPIeTJF0Cg0zLUFWngFN9225ft/zGIfclSboI3qEqSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl3o6nQ6Tk5McOXKEyclJOp3OqFuStm2gF2RLret0OszPz7O4uMjq6ipjY2PMzc0BMDs7O+LupK1z5C4BCwsLLC4uMjMzw759+5iZmWFxcZGFhYVRtyZti+EuAcvLyxw+fPgF2w4fPszy8vKIOpIujtMyEjAxMcGBAwc4e/bs89uuvPJKJiYmRtiVtH2O3CXgySef5OzZs1x33XV0Oh2uu+46zp49y5NPPjnq1qRtceQuAU899RRXXHEFjzzyCLOzsyThiiuu4Kmnnhp1a9K2GO5Sz5e//OXnl6vqBevSXuO0jCQ1yHCXpAYZ7pLUIMNdkho0ULgnuTHJo0lOJ7ltg/0vS/J7vf2fTHJw2I1Kkga3abgnGQPuBG4CrgVmk1zbVzYHfLGqvhN4N/DOYTcqSRrcICP364HTVfVYVT0L3AXc0ldzC/CB3vIHgSNJMrw2JUlbMcjv3A8Aj69bPwPccL6aqjqX5GngW4G/WF+U5BhwDGB8fJxut7u9rvX31vHPH9+R406+f/K8+173gdftyDkBfv01v75jx9bfb4OE+0Yj8NpGDVV1AjgBMDU1VdPT0wOcXvo7n+bTO3LcC/2PZtWLLmVp1xtkWuYMcPW69auAJ85Xk2Qf8CrA+7YlaUQGCff7gENJrklyGXAUONlXcxL46d7yTwJ/Wg53tIec73L1MtZetWm4V9U54FbgHmAZuLuqHk5yR5Kbe2WLwLcmOQ28HXjRzyWl3a6qqCqWlpaeX5b2qoEeHFZVp4BTfdtuX7f8V8CbhtuaJGm7vENVkhpkuEtSgwx3SWqQ4S5JDcqofhGQ5Czw+ZGcXLqw/fTdXS3tIq+pqis3KxpZuEu7VZL7q2pq1H1IF8NpGUlqkOEuSQ0y3KUXOzHqBqSL5Zy7JDXIkbskNchwl6QGGe6S1CDDXbtWktUkD677u+CjpJOcSvJNF9j/80m+cdD6bfQ7neTpvp7f2NtXSX57Xe2+JGeTfKi3/pYk7xlWL9JAj/yVRuRrVfU9gxZX1Y9tUvLzwO8AXx2wfjs+VlX/bIPtzwCTSV5eVV8DfgT4fztwfglw5K49Jsmrkjya5LW99U6St/aWP5dkf5LLk/zPJJ9K8pkkb07yc8CrgaUkS331B5MsJ3lvkoeT/FGSl/dq/nGSh5J8PMm7knzmItr/MPDjveVZoHMRx5IuyHDXbvbyvimON1fV06y9Gez9SY4C31xV7+373I3AE1X1+qqaBD5SVb/G2rt/Z6pqZoNzHQLurKrrgC8BP9Hb/pvA26rq+4HVAXr+wb6ev2PdvruAo0m+Afhu4JOD/WuQts5pGe1mG07LVNUfJ3kTcCfw+g0+92ngV5O8E/hQVX1sgHN9tqoe7C0/ABzszce/sqr+d2/77wIbTbmsd75pGarqoSQHWRu1n9qoRhoWR+7ac5K8BJgAvgZ8S//+qvpz4B+xFvL/Ocnt/TUb+Ot1y6usDXxy8d2+yEngV3FKRjvMcNde9O9Ye1n7LPC+JC9dvzPJq4GvVtXvsBak39vb9RXglYOepKq+CHwlyff1Nh292MaB9wF3VNWnh3As6bycltFu9vIkD65b/whr4fivgeur6itJPgr8IvBL6+peB7wryd8AXwd+trf9BPDhJE+eZ959I3PAe5M8A3SBpzep/8G+nv9TVX3wuZWqOgP81wHPLW2bz5aRLiDJK6pqpbd8G/BtVfVvR9yWtClH7tKF/XiSd7D238rngbeMth1pMI7cpS1K8qPAO/s2f7aq/vko+pE2YrhLUoP8tYwkNchwl6QGGe6S1CDDXZIa9LcBDmc49mLPQQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "data_all.boxplot(column='Existing_EMI',return_type='axes')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    1.246260e+05\n",
       "mean     3.636342e+03\n",
       "std      3.369124e+04\n",
       "min      0.000000e+00\n",
       "25%      0.000000e+00\n",
       "50%      0.000000e+00\n",
       "75%      3.500000e+03\n",
       "max      1.000000e+07\n",
       "Name: Existing_EMI, dtype: float64"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_all['Existing_EMI'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "#该特征丢失数据较少，使用中位数进行填充\n",
    "data_all['Existing_EMI'].fillna(0, inplace=True)\n",
    "#data_all['Existing_EMI'].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interest Rate特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Interest_Rate  Interest_Rate_Missing\n",
      "0            NaN                      1\n",
      "1          13.25                      0\n",
      "2            NaN                      1\n",
      "3            NaN                      1\n",
      "4            NaN                      1\n",
      "5          13.99                      0\n",
      "6            NaN                      1\n",
      "7            NaN                      1\n",
      "8          14.85                      0\n",
      "9          18.25                      0\n"
     ]
    }
   ],
   "source": [
    "#样本丢失很多，这里添加一个新的特征表示是否丢失\n",
    "data_all['Interest_Rate_Missing'] = data_all['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "print data_all[['Interest_Rate','Interest_Rate_Missing']].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_all.drop('Interest_Rate',axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loan Amount和Tenure applied特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "#特征样本丢失较少，用中位数填补\n",
    "data_all['Loan_Amount_Applied'].fillna(data_all['Loan_Amount_Applied'].median(),inplace=True)\n",
    "data_all['Loan_Tenure_Applied'].fillna(data_all['Loan_Tenure_Applied'].median(),inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loan Amount and Tenure selected特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "#样本丢失很多，这里添加一个新的特征表示是否丢失\n",
    "data_all['Loan_Amount_Submitted_Missing'] = data_all['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "data_all['Loan_Tenure_Submitted_Missing'] = data_all['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_all.drop(['Loan_Amount_Submitted','Loan_Tenure_Submitted'],axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### logged-in特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "#该参数不使用\n",
    "data_all.drop('LoggedIn',axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ID特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "#该参数不使用\n",
    "data_all.drop('ID',axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### rocessing_Fee特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "#样本丢失很多，这里添加一个新的特征表示是否丢失\n",
    "data_all['Processing_Fee_Missing'] = data_all['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "data_all.drop('Processing_Fee',axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Source特征处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    S122\n",
       "1    S122\n",
       "2    S143\n",
       "3    S143\n",
       "4    S134\n",
       "5    S143\n",
       "6    S133\n",
       "7    S159\n",
       "8    S122\n",
       "9    S133\n",
       "Name: Source, dtype: object"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_all['Source'].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "S122      55249\n",
       "S133      42900\n",
       "others    12449\n",
       "S159       7999\n",
       "S143       6140\n",
       "Name: Source, dtype: int64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_all['Source'] = data_all['Source'].apply(lambda x: 'others' if x not in ['S122','S133','S159','S143'] else x)\n",
    "data_all['Source'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 最终数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 124737 entries, 0 to 124736\n",
      "Data columns (total 21 columns):\n",
      "Data_source                      124737 non-null object\n",
      "Device_Type                      124737 non-null object\n",
      "Disbursed                        87020 non-null float64\n",
      "Existing_EMI                     124737 non-null float64\n",
      "Filled_Form                      124737 non-null object\n",
      "Gender                           124737 non-null object\n",
      "Loan_Amount_Applied              124737 non-null float64\n",
      "Loan_Tenure_Applied              124737 non-null float64\n",
      "Mobile_Verified                  124737 non-null object\n",
      "Monthly_Income                   124737 non-null int64\n",
      "Source                           124737 non-null object\n",
      "Var1                             124737 non-null object\n",
      "Var2                             124737 non-null object\n",
      "Var4                             124737 non-null int64\n",
      "Var5                             124737 non-null int64\n",
      "Age                              124737 non-null int64\n",
      "EMI_Loan_Submitted_Missing       124737 non-null int64\n",
      "Interest_Rate_Missing            124737 non-null int64\n",
      "Loan_Amount_Submitted_Missing    124737 non-null int64\n",
      "Loan_Tenure_Submitted_Missing    124737 non-null int64\n",
      "Processing_Fee_Missing           124737 non-null int64\n",
      "dtypes: float64(4), int64(9), object(8)\n",
      "memory usage: 20.0+ MB\n"
     ]
    }
   ],
   "source": [
    "data_all.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 独热编码One-Hot-Encoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([                  u'Data_source',                     u'Disbursed',\n",
       "                        u'Existing_EMI',           u'Loan_Amount_Applied',\n",
       "                 u'Loan_Tenure_Applied',                u'Monthly_Income',\n",
       "                                u'Var4',                          u'Var5',\n",
       "                                 u'Age',    u'EMI_Loan_Submitted_Missing',\n",
       "               u'Interest_Rate_Missing', u'Loan_Amount_Submitted_Missing',\n",
       "       u'Loan_Tenure_Submitted_Missing',        u'Processing_Fee_Missing',\n",
       "                       u'Device_Type_0',                 u'Device_Type_1',\n",
       "                       u'Filled_Form_0',                 u'Filled_Form_1',\n",
       "                            u'Gender_0',                      u'Gender_1',\n",
       "                              u'Var1_0',                        u'Var1_1',\n",
       "                              u'Var1_2',                        u'Var1_3',\n",
       "                              u'Var1_4',                        u'Var1_5',\n",
       "                              u'Var1_6',                        u'Var1_7',\n",
       "                              u'Var1_8',                        u'Var1_9',\n",
       "                             u'Var1_10',                       u'Var1_11',\n",
       "                             u'Var1_12',                       u'Var1_13',\n",
       "                             u'Var1_14',                       u'Var1_15',\n",
       "                             u'Var1_16',                       u'Var1_17',\n",
       "                             u'Var1_18',                        u'Var2_0',\n",
       "                              u'Var2_1',                        u'Var2_2',\n",
       "                              u'Var2_3',                        u'Var2_4',\n",
       "                              u'Var2_5',                        u'Var2_6',\n",
       "                   u'Mobile_Verified_0',             u'Mobile_Verified_1',\n",
       "                            u'Source_0',                      u'Source_1',\n",
       "                            u'Source_2',                      u'Source_3',\n",
       "                            u'Source_4'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "le = LabelEncoder()\n",
    "var_to_encode = ['Device_Type','Filled_Form','Gender','Var1','Var2','Mobile_Verified','Source']\n",
    "for col in var_to_encode:\n",
    "    data_all[col] = le.fit_transform(data_all[col])\n",
    "data_all = pd.get_dummies(data_all, columns=var_to_encode)\n",
    "data_all.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 124737 entries, 0 to 124736\n",
      "Data columns (total 53 columns):\n",
      "Data_source                      124737 non-null object\n",
      "Disbursed                        87020 non-null float64\n",
      "Existing_EMI                     124737 non-null float64\n",
      "Loan_Amount_Applied              124737 non-null float64\n",
      "Loan_Tenure_Applied              124737 non-null float64\n",
      "Monthly_Income                   124737 non-null int64\n",
      "Var4                             124737 non-null int64\n",
      "Var5                             124737 non-null int64\n",
      "Age                              124737 non-null int64\n",
      "EMI_Loan_Submitted_Missing       124737 non-null int64\n",
      "Interest_Rate_Missing            124737 non-null int64\n",
      "Loan_Amount_Submitted_Missing    124737 non-null int64\n",
      "Loan_Tenure_Submitted_Missing    124737 non-null int64\n",
      "Processing_Fee_Missing           124737 non-null int64\n",
      "Device_Type_0                    124737 non-null uint8\n",
      "Device_Type_1                    124737 non-null uint8\n",
      "Filled_Form_0                    124737 non-null uint8\n",
      "Filled_Form_1                    124737 non-null uint8\n",
      "Gender_0                         124737 non-null uint8\n",
      "Gender_1                         124737 non-null uint8\n",
      "Var1_0                           124737 non-null uint8\n",
      "Var1_1                           124737 non-null uint8\n",
      "Var1_2                           124737 non-null uint8\n",
      "Var1_3                           124737 non-null uint8\n",
      "Var1_4                           124737 non-null uint8\n",
      "Var1_5                           124737 non-null uint8\n",
      "Var1_6                           124737 non-null uint8\n",
      "Var1_7                           124737 non-null uint8\n",
      "Var1_8                           124737 non-null uint8\n",
      "Var1_9                           124737 non-null uint8\n",
      "Var1_10                          124737 non-null uint8\n",
      "Var1_11                          124737 non-null uint8\n",
      "Var1_12                          124737 non-null uint8\n",
      "Var1_13                          124737 non-null uint8\n",
      "Var1_14                          124737 non-null uint8\n",
      "Var1_15                          124737 non-null uint8\n",
      "Var1_16                          124737 non-null uint8\n",
      "Var1_17                          124737 non-null uint8\n",
      "Var1_18                          124737 non-null uint8\n",
      "Var2_0                           124737 non-null uint8\n",
      "Var2_1                           124737 non-null uint8\n",
      "Var2_2                           124737 non-null uint8\n",
      "Var2_3                           124737 non-null uint8\n",
      "Var2_4                           124737 non-null uint8\n",
      "Var2_5                           124737 non-null uint8\n",
      "Var2_6                           124737 non-null uint8\n",
      "Mobile_Verified_0                124737 non-null uint8\n",
      "Mobile_Verified_1                124737 non-null uint8\n",
      "Source_0                         124737 non-null uint8\n",
      "Source_1                         124737 non-null uint8\n",
      "Source_2                         124737 non-null uint8\n",
      "Source_3                         124737 non-null uint8\n",
      "Source_4                         124737 non-null uint8\n",
      "dtypes: float64(4), int64(9), object(1), uint8(39)\n",
      "memory usage: 18.0+ MB\n"
     ]
    }
   ],
   "source": [
    "data_all.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 训练数据与测试数据分割"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train = data_all.loc[data_all['Data_source']=='dtrain']\n",
    "data_test = data_all.loc[data_all['Data_source']=='dtest']\n",
    "#data_train.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ice2018/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py:3697: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "  errors=errors)\n"
     ]
    }
   ],
   "source": [
    "data_train.drop('Data_source',axis=1,inplace=True)\n",
    "data_test.drop(['Data_source','Disbursed'],axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Disbursed</th>\n",
       "      <th>Existing_EMI</th>\n",
       "      <th>Loan_Amount_Applied</th>\n",
       "      <th>Loan_Tenure_Applied</th>\n",
       "      <th>Monthly_Income</th>\n",
       "      <th>Var4</th>\n",
       "      <th>Var5</th>\n",
       "      <th>Age</th>\n",
       "      <th>EMI_Loan_Submitted_Missing</th>\n",
       "      <th>Interest_Rate_Missing</th>\n",
       "      <th>...</th>\n",
       "      <th>Var2_4</th>\n",
       "      <th>Var2_5</th>\n",
       "      <th>Var2_6</th>\n",
       "      <th>Mobile_Verified_0</th>\n",
       "      <th>Mobile_Verified_1</th>\n",
       "      <th>Source_0</th>\n",
       "      <th>Source_1</th>\n",
       "      <th>Source_2</th>\n",
       "      <th>Source_3</th>\n",
       "      <th>Source_4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>300000.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>37</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>200000.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>35000</td>\n",
       "      <td>3</td>\n",
       "      <td>13</td>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>600000.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>22500</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>34</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1000000.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>35000</td>\n",
       "      <td>3</td>\n",
       "      <td>10</td>\n",
       "      <td>28</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>25000.0</td>\n",
       "      <td>500000.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>100000</td>\n",
       "      <td>3</td>\n",
       "      <td>17</td>\n",
       "      <td>31</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.0</td>\n",
       "      <td>15000.0</td>\n",
       "      <td>300000.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>45000</td>\n",
       "      <td>3</td>\n",
       "      <td>17</td>\n",
       "      <td>33</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>70000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>28</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.0</td>\n",
       "      <td>2597.0</td>\n",
       "      <td>200000.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20000</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>40</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>75000</td>\n",
       "      <td>5</td>\n",
       "      <td>13</td>\n",
       "      <td>43</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>300000.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>30000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>26</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows × 52 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Disbursed  Existing_EMI  Loan_Amount_Applied  Loan_Tenure_Applied  \\\n",
       "0        0.0           0.0             300000.0                  5.0   \n",
       "1        0.0           0.0             200000.0                  2.0   \n",
       "2        0.0           0.0             600000.0                  4.0   \n",
       "3        0.0           0.0            1000000.0                  5.0   \n",
       "4        0.0       25000.0             500000.0                  2.0   \n",
       "5        0.0       15000.0             300000.0                  5.0   \n",
       "6        0.0           0.0                  6.0                  5.0   \n",
       "7        0.0        2597.0             200000.0                  5.0   \n",
       "8        0.0           0.0                  0.0                  0.0   \n",
       "9        0.0           0.0             300000.0                  3.0   \n",
       "\n",
       "   Monthly_Income  Var4  Var5  Age  EMI_Loan_Submitted_Missing  \\\n",
       "0           20000     1     0   37                           1   \n",
       "1           35000     3    13   30                           0   \n",
       "2           22500     1     0   34                           1   \n",
       "3           35000     3    10   28                           1   \n",
       "4          100000     3    17   31                           1   \n",
       "5           45000     3    17   33                           0   \n",
       "6           70000     1     0   28                           1   \n",
       "7           20000     3     3   40                           1   \n",
       "8           75000     5    13   43                           0   \n",
       "9           30000     1     0   26                           0   \n",
       "\n",
       "   Interest_Rate_Missing    ...     Var2_4  Var2_5  Var2_6  Mobile_Verified_0  \\\n",
       "0                      1    ...          0       0       1                  1   \n",
       "1                      0    ...          0       0       1                  0   \n",
       "2                      1    ...          0       0       0                  0   \n",
       "3                      1    ...          0       0       0                  0   \n",
       "4                      1    ...          0       0       0                  0   \n",
       "5                      0    ...          0       0       0                  0   \n",
       "6                      1    ...          0       0       0                  1   \n",
       "7                      1    ...          0       0       0                  0   \n",
       "8                      0    ...          0       0       0                  0   \n",
       "9                      0    ...          0       0       0                  0   \n",
       "\n",
       "   Mobile_Verified_1  Source_0  Source_1  Source_2  Source_3  Source_4  \n",
       "0                  0         1         0         0         0         0  \n",
       "1                  1         1         0         0         0         0  \n",
       "2                  1         0         0         1         0         0  \n",
       "3                  1         0         0         1         0         0  \n",
       "4                  1         0         0         0         0         1  \n",
       "5                  1         0         0         1         0         0  \n",
       "6                  0         0         1         0         0         0  \n",
       "7                  1         0         0         0         1         0  \n",
       "8                  1         1         0         0         0         0  \n",
       "9                  1         0         1         0         0         0  \n",
       "\n",
       "[10 rows x 52 columns]"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_train.to_csv('train_save.csv',index=False)\n",
    "data_test.to_csv('test_save.csv',index=False)\n",
    "data_train.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Existing_EMI</th>\n",
       "      <th>Loan_Amount_Applied</th>\n",
       "      <th>Loan_Tenure_Applied</th>\n",
       "      <th>Monthly_Income</th>\n",
       "      <th>Var4</th>\n",
       "      <th>Var5</th>\n",
       "      <th>Age</th>\n",
       "      <th>EMI_Loan_Submitted_Missing</th>\n",
       "      <th>Interest_Rate_Missing</th>\n",
       "      <th>Loan_Amount_Submitted_Missing</th>\n",
       "      <th>...</th>\n",
       "      <th>Var2_4</th>\n",
       "      <th>Var2_5</th>\n",
       "      <th>Var2_6</th>\n",
       "      <th>Mobile_Verified_0</th>\n",
       "      <th>Mobile_Verified_1</th>\n",
       "      <th>Source_0</th>\n",
       "      <th>Source_1</th>\n",
       "      <th>Source_2</th>\n",
       "      <th>Source_3</th>\n",
       "      <th>Source_4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>87020</th>\n",
       "      <td>0.0</td>\n",
       "      <td>100000.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>21500</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>28</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87021</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>42000</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "      <td>35</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87022</th>\n",
       "      <td>0.0</td>\n",
       "      <td>300000.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>10000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>26</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87023</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>14650</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>24</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87024</th>\n",
       "      <td>5000.0</td>\n",
       "      <td>100000.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>23400</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>28</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87025</th>\n",
       "      <td>4500.0</td>\n",
       "      <td>100000.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>15000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>29</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87026</th>\n",
       "      <td>30000.0</td>\n",
       "      <td>200000.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>69000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>43</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87027</th>\n",
       "      <td>7497.0</td>\n",
       "      <td>100000.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>20555</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87028</th>\n",
       "      <td>0.0</td>\n",
       "      <td>100000.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>30000</td>\n",
       "      <td>2</td>\n",
       "      <td>12</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87029</th>\n",
       "      <td>0.0</td>\n",
       "      <td>300000.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>40400</td>\n",
       "      <td>3</td>\n",
       "      <td>15</td>\n",
       "      <td>32</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows × 51 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       Existing_EMI  Loan_Amount_Applied  Loan_Tenure_Applied  Monthly_Income  \\\n",
       "87020           0.0             100000.0                  3.0           21500   \n",
       "87021           0.0                  0.0                  0.0           42000   \n",
       "87022           0.0             300000.0                  2.0           10000   \n",
       "87023           0.0                  0.0                  0.0           14650   \n",
       "87024        5000.0             100000.0                  1.0           23400   \n",
       "87025        4500.0             100000.0                  0.0           15000   \n",
       "87026       30000.0             200000.0                  5.0           69000   \n",
       "87027        7497.0             100000.0                  2.0           20555   \n",
       "87028           0.0             100000.0                  1.0           30000   \n",
       "87029           0.0             300000.0                  4.0           40400   \n",
       "\n",
       "       Var4  Var5  Age  EMI_Loan_Submitted_Missing  Interest_Rate_Missing  \\\n",
       "87020     3     3   28                           0                      0   \n",
       "87021     5     8   35                           0                      0   \n",
       "87022     1     0   26                           1                      1   \n",
       "87023     1     0   24                           1                      1   \n",
       "87024     1     0   28                           1                      1   \n",
       "87025     1     0   29                           1                      1   \n",
       "87026     1     0   43                           1                      1   \n",
       "87027     1     0   25                           1                      1   \n",
       "87028     2    12   25                           1                      1   \n",
       "87029     3    15   32                           1                      1   \n",
       "\n",
       "       Loan_Amount_Submitted_Missing    ...     Var2_4  Var2_5  Var2_6  \\\n",
       "87020                              0    ...          0       0       0   \n",
       "87021                              0    ...          0       0       0   \n",
       "87022                              1    ...          0       0       0   \n",
       "87023                              1    ...          0       0       0   \n",
       "87024                              0    ...          0       0       0   \n",
       "87025                              1    ...          0       0       0   \n",
       "87026                              1    ...          0       0       0   \n",
       "87027                              1    ...          0       0       0   \n",
       "87028                              0    ...          0       0       0   \n",
       "87029                              0    ...          0       0       0   \n",
       "\n",
       "       Mobile_Verified_0  Mobile_Verified_1  Source_0  Source_1  Source_2  \\\n",
       "87020                  0                  1         1         0         0   \n",
       "87021                  0                  1         0         1         0   \n",
       "87022                  1                  0         0         1         0   \n",
       "87023                  1                  0         0         1         0   \n",
       "87024                  0                  1         0         0         1   \n",
       "87025                  1                  0         0         1         0   \n",
       "87026                  1                  0         0         0         0   \n",
       "87027                  1                  0         0         1         0   \n",
       "87028                  0                  1         0         0         0   \n",
       "87029                  0                  1         0         1         0   \n",
       "\n",
       "       Source_3  Source_4  \n",
       "87020         0         0  \n",
       "87021         0         0  \n",
       "87022         0         0  \n",
       "87023         0         0  \n",
       "87024         0         0  \n",
       "87025         0         0  \n",
       "87026         0         1  \n",
       "87027         0         0  \n",
       "87028         0         1  \n",
       "87029         0         0  \n",
       "\n",
       "[10 rows x 51 columns]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_test.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
